Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure proper shutdown of libbeat #1075

Merged
merged 1 commit into from
Mar 4, 2016
Merged

Conversation

McStork
Copy link
Contributor

@McStork McStork commented Mar 1, 2016

Libbeat publisher now exits once all async events are published, all
events outputed and ConnectionModes closed.

  • Move WorkerSignal to libbeat/common: WorkerSignals are used in packages publisher and outputs
  • Add a second WaitGroup to WorkerSignal. WaitGroups are used to track events and shutdown of goroutines
  • Remove TestMessageWorkerStopQueue and stopQueue function in WorkerSignal: events are not considered
    failed at shutdown anymore
  • Adapt existing Libbeat tests
  • Add a few tests for published events after shutdown
  • Add method Close to outputs/Outputer. The method is called by publisher/outputWorker onStop()
  • Edit CHANGELOG

@McStork
Copy link
Contributor Author

McStork commented Mar 1, 2016

Could not reopen previous PR due to forced push

@elasticsearch-release
Copy link

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run; then say 'jenkins, test it'.

@ruflin
Copy link
Contributor

ruflin commented Mar 1, 2016

@McStork Could this also resolve #1074 ?

@McStork
Copy link
Contributor Author

McStork commented Mar 1, 2016

@ruflin No, I doubt it will help fix this race issue :(
Seems like the race is deep in ProtocolClient.

return false
}
ch <- msg
return true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this channel might block. Why remove the Done channel here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it should have the same behavior with or without this case <-done because here we make sure to output all events received.
The publisher being closed, balance forwardEvent() will not be called once we reach WGevent.Wait().

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

problem is, if output is unresponsive (e.g. infinite retry + network issues), the publisher will basically hang on shutdown doing no progress.

@McStork McStork force-pushed the libbeat-proper-shut branch from 485186e to 9cc4581 Compare March 4, 2016 08:54
…down)

Libbeat publisher now exits once all async events are published, all
events outputed and ConnectionModes closed.

* Move WorkerSignal to libbeat/common: WorkerSignals are used in packages publisher and outputs
* Add a second WaitGroup to WorkerSignal. WaitGroups are used to track events and shutdown of Go routines
* Add methods in WorkerSignal that wrap WaitGroups methods call
* Remove TestMessageWorkerStopQueue and stopQueue function in WorkerSignal: events are not considered
failed at shutdown anymore
* Add tests for published events after shutdown
* Add method Close to outputs/Outputer. The method is called by publisher/outputWorker onStop()
* Edit CHANGELOG
@McStork McStork force-pushed the libbeat-proper-shut branch from 9cc4581 to 7f16e11 Compare March 4, 2016 09:23
@McStork
Copy link
Contributor Author

McStork commented Mar 4, 2016

Squashed and rebased after review changes

@urso
Copy link

urso commented Mar 4, 2016

LGTM.

Problem with this approach is, beats won't shutdown if output is unresponsive. But this is already the case, no regression here.

In future I'd like to have publisher.Client return a new client instance connecting to the publisher pipeline, which is closable.

So beats startup goes like:

  1. start publisher pipeline
  2. initialize beat itself
  3. create (connect) client to pipeline

On shutdown we can disconnect from pipeline, so publisher pipeline and beats can shutdown in parallel. For filebeat the outputs.Signaler shall get a new event 'Stopped', so filebeat signal handler can differentiate between Fail and Shutdown handling for proper registry update on shutdown. With this change beats don't have to wait for pipeline being empty on shutdown, as pipeline is just dropped as is (events will be lost, yes). In addition libbeat might get support for internal persistent queue, to not drop events on shutdown (or in general).

urso pushed a commit that referenced this pull request Mar 4, 2016
Ensure proper shutdown of libbeat
@urso urso merged commit 6d6cd52 into elastic:master Mar 4, 2016
@McStork
Copy link
Contributor Author

McStork commented Mar 4, 2016

@urso Thanks for giving your insight.

If I understand correctly, with a feature such as a persistent queue, a good solution would be to save unpublished events to disk at shutdown. A persistent queue would certainly help when getting stuck because of unresponsive output.

But is it be necessary do like this 100% of the time? What if:

  1. If at shutdown the output gets stuck, events are on the disk anyway so none are lost and the program closes.
  2. If at shutdown the output still forward events, wait for it to empty all events (or forward them until we reach a situation such as 1.)

The thing would be to measure the responsiveness of the output to know if we are in situation 1 or 2.

@McStork McStork deleted the libbeat-proper-shut branch March 6, 2016 10:40
@McStork
Copy link
Contributor Author

McStork commented Mar 6, 2016

I think WorkerSignal can help solve the problem of output blocking the shutdown of a beat.
WorkerSignal could have a StopState channel that notify the connection modes that the beat is in shutdown mode.
StopState is closed when calling WorkerSignal.Stop(). The connection mode knowing it is in shutdown state, will drop events when he meets a failed connection/publishing attempts. WorkerSignal then ends without blocking.

I tried this today and it seems like it could work well :).

@McStork
Copy link
Contributor Author

McStork commented Mar 6, 2016

The StopState channel doesn't have to be in WorkerSignal. It can just be in PublisherType and closed before calling pub.wsOutput.Stop().

@urso
Copy link

urso commented Mar 6, 2016

If I understand correctly, with a feature such as a persistent queue, a good solution would be to save unpublished events to disk at shutdown. A persistent queue would certainly help when getting stuck because of unresponsive output.

Persistent queue is not about shutdown only, but more about resiliency. Having configured a persistent queue all events must pass the queue and only once ACKed by the queue, events will be published (with send-at-leat-once semantics).

Thing is, there is no paticular need to add too much logic for handling shutdown. For most beats it's fine to drop data. Beats not being fine with event loss (e.g. filebeat) remember where they left off and can restart from last location. No problem if queue is dropped. Still it's of interest to signal 'shutting' down events to filebeat to do one last registry update.

The thing would be to measure the responsiveness of the output to know if we are in situation 1 or 2.

I won't even try. Just drop events in pipeline and continue where left off. Filebeat/Winlogbeat or persistent queue remember where they left off and resend events not ACKed by output plugin.

The StopState channel doesn't have to be in WorkerSignal ...

Beats can already provide callback to PublishEvent (type output.Signaler) for getting Success/Fail events. This is used by filebeat async publisher mode:

Current problem is I can not distinguish between failure (which should only be possible during shutdown) or shutdown. Adding a shutdown callback we could do a 'panic' in 'Failed' handler.

Will be in vacation for next 2 weeks. Let's continue discussion thereafter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants